Gradient-flow adaptive importance sampling for Bayesian leave one out cross-validation for sigmoidal classification models

We introduce a set of gradient-flow-guided adaptive importance sampling (IS) transformations to stabilize Monte-Carlo approximations of point-wise leave one out cross-validated (LOO) predictions for Bayesian classification models. One can leverage this methodology for assessing model generalizability by for instance computing a LOO analogue to the AIC or computing LOO ROC/PRC curves and derived metrics like the AUROC and AUPRC. By the calculus of variations and gradient flow, we derive two simple nonlinear single-step transformations that utilize gradient information to shift a model’s pre-trained full-data posterior closer to the target LOO posterior predictive distributions. In doing so, the transformations stabilize importance weights. Because the transformations involve the gradient of the likelihood function, the resulting Monte Carlo integral depends on Jacobian determinants with respect to the model Hessian. We derive closed-form exact formulae for these Jacobian determinants in the cases of logistic regression and shallow ReLU-activated artificial neural networks, and provide a simple approximation that sidesteps the need to compute full Hessian matrices and their spectra. We test the methodology on an n≪p dataset that is known to produce unstable LOO IS weights.


INTRODUCTION
In Bayesian formulations of classification problems where one uses a vector of covariates x ∈ R p to estimate the probability of an outcome labeled by y ∈ {0, 1}, one specifies a likelihood function ℓ, where and where p i (θ) ≡ p(θ, x i ) is the predicted outcome probability for observation i, and computes the statistics of the parameter vector θ by applying Bayes rule where π(θ) is the joint prior distribution over the parameter vector θ, corresponding to the likelihood function ℓ(θ).
There are usually many plausible choices for ℓ, even within the same family of models (for example different sets of features in logistic regression) -one needs methodology to evaluate different models to inform decisions such as model selection.
One usually wishes to evaluate how a model will generalize to new data, by estimating its accuracy on data that is both unused during training and statistically independent from the training data.Simulating performance under unseen data requires either fitting the model multiple times (for k-fold cross validation, which can be expensive on large scale datasets, with large-scale models), or removing a subset of the data entirely (in the case of test/train splits).
Other general methods of assessing model generalizability exist, notably information criterion related to the Akaike Information criteria (AIC).The AIC and Bayesian variants [Gelman et al., 2014, Watanabe, 2013, Vehtari et al., 2017, Piironen and Vehtari, 2017a] are asymptotic approximations [Stone, 1977, Watanabe, 2010] of a model's leave one out (LOO) cross validation loss.In their standard usage, these methods approximate the total log LOO likelihood for a given model/dataset, a quantity that does not directly map to ROC/PRC curves.However, similar methods can be used to compute LOO versions of these curve.The receiver operator curve (AUROC) [Hanley and McNeil, 1982] and the area under the precision recall cure (AUPRC) are familiar, interpretable, and often desirable for use in decision making processes.
In this manuscript, we motivate an adaptive importance sampling method for LOO, based on transformations derived by reduction of a given functional criterion by stepping down a gradient flow.These transformations use model-specific gradient information -we derive explicit formulae for the transformations and their Jacobian determinants in the case of common sigmoidal classification models (logistic regression and Bayesian ReLU-activated neural networks).

PRELIMINARIES 2.1 NOTATION
We denote vectors using bold-faced lower case symbols, and matrices using bold-faced uppercase symbols.By convention, all vectors are assumed to be column vectors unless otherwise stated.Given a matrix W = (w ij ), the i-th row is denoted w i, and j-th column is denoted w ,j .
We refer to the entire set of observed training data as As shorthand, we denote the set of training data where the i-th observation is left out as Expectations with respect to the posterior distribution of θ are denoted E θ|D , and with respect to the posterior distribution of θ if observation i is left out are denoted E θ|D (−i) .
For a transformation T : Ω → Ω, where Ω ⊂ R p , we denote its Jacobian matrix J T = ∇T = ∂ α T β α,β and the determinant of the Jacobian matrix J T = |J T |.The gradient operator ∇ operating on a function µ : R p → R is assumed to yield a column vector, the Hessian matrix for a function µ is denoted ∇∇µ, and the Laplacian of µ is denoted ∇ 2 µ The operator | • | refers to determinants when the argument is a matrix, the 2-norm when the argument is a vector, and the absolute value when the argument is a scalar.

IMPORTANCE SAMPLING-BASED APPROXIMATE LEAVE ONE OUT CROSS VALIDATION (IS-LOO)
Suppose that one has pre-trained a Bayesian classification model such that one is able to sample its posterior parameters θ s iid ∼ π(θ|D).Our objective is to use knowledge of this full-data posterior distribution to estimate how the model would behave if any single point is left out at training.One can relate the full-data model to the model with observation i left out using the Bayesian update equation which is a Fredholm second-kind integral equation with respect to π(θ|D (−i) ).This integral equation is in-practice difficult to solve due to the typically-high dimensionality of θ.
Rather than attempt direct inversion of Eq. 3, our starting point is the observation that π(θ|D which provides the ratio of densities between a distribution we know (the full-data posterior π(θ|D)) and a distribution whose statistics we would like to compute (the point-wise LOO posterior π(θ|D (−i) )).To use the former in order to compute statistics of the latter we turn to importance sampling, noting that for an integrable function f, Sampling θ k iid ∼ π(θ|D), we can approximate Eq. 5 using the Monte-Carlo integral where the coefficients ν ik are known as the self-normalized importance sampling weights so that the undetermined constant E θ|D (−i) [ℓ(θ|x i , y i )] cancels out.Eqs. 6, 7 define a well-known [Gelfand et al., 1992] Monte-Carlo estimator for LOO.

LOO-BASED METRICS
In Bayesian workflows, it is common to evaluate model generalizability by using importance sampling LOO to compute a Bayesian analogue to the Aikaike Information Criterion (AIC).The LOO information criterion (LOO-IC) takes the form Alternatively, for classification problems, one often desires other metrics such as an estimate of the out-of-sample area under the receiver operator curve or precision-recall curve.
To compute these quantities, one simply propagates LOO estimates of the outcome probabilities into the relevant formulae.For instance, one may use the probabilistic interpretation of the AUROC [Wang and Guo, 2020] to motivate the direct estimator of the LOO AUROC, where D 0 is the set of all negative observations and D 1 is the set of all positive observations.

WEIGHT STABILIZATION
Often it is the case that using the computed posterior π(θ|D) as the proposal distribution for importance sampling has slow convergence properties -the 1/ℓ importance weights are known to have large or unbounded variance [Peruggia, 1997], making the importance sampler estimate for LOO noisy.
Two practical model agnostic methods for controlling the tail of the importance weights are through weight truncation [Ionides, 2008] and Pareto smoothing [Vehtari et al., 2015[Vehtari et al., , 2017]].Pareto smoothing replaces the largest M weights with their corresponding rank-values from a fitted generalized Pareto-distribution [Zhang and Stephens, 2009].Pareto smoothed importance sample (PSIS)-based LOO implementations are widely available in software packages such as Stan and ArviZ.However, PSIS-LOO is insufficient in problems where the Pareto distribution does not well-fit the tail distribution of importance weights -where the estimated Pareto shape parameter k exceeds 0.7 -this parameter is a useful criteria for accessing the validity of a given importance sampler.In such a case, it would be desirable to perform an additional model-specific controlled transformation on the proposal distribution that will induce more efficient computations.

ADAPTIVE IMPORTANCE SAMPLING
The objective of adaptive importance sampling [Bugallo et al., 2017, Cornuet et al., 2011, Elvira and Martino, 2022] in the context of LOO is to use a transformation to nudge, independently for each observation i, the posterior distribution closer to the LOO distribution π(θ|D (−i) ) (relationships between the different distributions is depicted in Fig. 1).First One wants to sample from π(θ|D (−i) ), the LOO distribution for observation i, by sampling from the full-data posterior π(θ|D).The transformation T i on the full-data posterior brings the sampling distribution closer to the target LOO distribution.
we'll examine importance sampling under an arbitrary bijective transformation.
Consider the bijection T i : R p → R p , defined for observation i, and let ϕ Ti (ϕ), the determinants of the Jacobians of the inverse and forward transformations respectively.One can rewrite the expectation in Eq. 5 in terms of an integral over π ϕ , One can then define a Monte-Carlo approximation of Eq. 11 using importance sampling, by sampling By Bayes rule (Eq.3), the posterior likelihood ratio in Eqs.11-13 has the exact expression Computing this expression requires iterating over the entire dataset.For large datasets, one can turn to variational approximations.

ADAPTIVE IMPORTANCE SAMPLING USING VARIATIONAL POSTERIORS
For computational expediency, variational methods are often used in place of MCMC for Bayesian inference, obtaining a variational approximation π(θ|D) to the true posterior, where π lies within a given family of transformed probability distributions.In problems where one expects a substantial discrepancy between the true posterior and π, one may correct for this discrepancy by noting that and using the self-normalized importance sampler where π(ϕ k ) is the prior density at ϕ k , canceling out the two unknown constants corresponding to π(ϕ k |D) and ν i .

METHODS
Eq. 13 is valid for an arbitrary bijection T i .The objective in using transformations is to shift the proposal distribution closer to the targeted LOO distribution for each observation -to partially invert Bayes rule.First, we motivate three different single-step transformations of the form for some small step-size h and function Q i .

SINGLE STEP TRANSFORMATIONS
KL divergence descent: We consider choosing T i to minimize the KL divergence D KL π(θ|D (−i) )∥π ϕ (θ|D) , which is equivalent to minimizing the cross-entropy with respect to the mapping T i , H π(θ|D The Euler-Lagrange equation for minimizing Eq. 19 (derived in Supplemental Materials S.1.1),is implicit in T i .
While it admits no closed form solution, one may note that T i is a t → ∞ stable fixed point of the KL-descending gradient flow and use this fact to refine, using the method of lines, an initial guess of , to arrive at the transformation Variance descent: In importance sampling, the variance of the estimator is conditional on the target function for expectation.Since we are interested in computing the LOO predictive probability for each observation i, it is natural to consider minimizing the variance of the transformed importance sampler for the function p i (θ) = p(θ, x i ).However, this objective yields a transformation that is only useful for observations where y i = 0 (see Supplemental MaterialsS.1.2).Instead, we seek to minimize the variance with respect to estimating the complement probability Starting from the associated variational problem (Appendix S.1.2),and applying the same rationale that went into developing the KL-descending transformation, one arrives at the single step variance-reducing transformation, Log-likelihood descent: The two single step transformations of Eq.21 require computing both the gradient and Hessian of the trained full-data posterior.We also consider a simpler alternate transformation [Elvira et al., 2023] equivalent to repelling each parameter sample a step h > 0 away from the maximization of the log likelihood function for point i.

RESOLVING THE POSTERIOR DENSITY
Both the KL (Eq.21) and variance (Eq.22) descent transformations take steps proportional to the posterior density π(θ|D).If a variational approximation for π(θ|D) is available, using it in Eqs.21 and 22 as a stand-in for the posterior density helps simplify the computation of the transformations and their Jacobians, particularly when using mean-field or low-order ADVI.
In the absence of variational approximation, one may evaluate the posterior densities exactly using Bayes rule, absorbing the unknown normalization constant Z into the step size h.Then, the KL and variance descent transformations are and The obvious downside of using these exact transformations is the need to iterate over the entire dataset in order to evaluate the posterior density, which must be done for each parameter sample, for each data point.
For evaluating the Jacobian determinants, one appeals to Bayes rule to find that where Z is is absorbed into h.

STEP SIZE SELECTION
The KL-divergence and variance descent transformations correspond to a forward Euler solver on the respective gradient flow equations.According to linear stability analysis, Euler's method has the conditional stability criteria h < 2/max k |Re(λ k )| where λ k are the eigenvalues of the Jacobian of the system (Jacobians of the functions Q i ).
In each case the structure of the Jacobian admits cheap approximations of λ k .However, for nonlinear systems, this criterion is not sufficient for achieving stability.
Instead, we use a modified rule to determine the step size.For all parameter samples at each individual observation i, we use where ρ > 0 and Σ α,α is the marginal posterior standard deviation of the α-th component of θ.This rule ensures that the transformation takes a step of at most ρ posterior standard deviations in any parameter component.The objective of adaptation is to find any transformation that results in importance weights where the Pareto tail shape is subthreshold.To this end, one can compute the transformations for a range of ρ values in parallel using vectorized computations, saving computation at the cost of memory utilization.

JACOBIAN DETERMINANT APPROXIMATION
For any of the three single-step transformations, one may approximate |J Ti | by noting that and truncating to O(h), sidestepping the computation of Hessian matrices and their spectra.Note that any higher order terms in this expansion require characterization of the spectra of ∇Q i , for each observation i, and for each sampled parameter θ k .For large problems, computing the Jacobian matrix and its spectra this many times can become computationally problematic.

OVERVIEW
We have presented three single step transformations, each aimed at stabilizing a LOO importance sampler by bringing the proposal distribution closer to the LOO target in a different sense.The KL divergence and variance descent transformations correspond to the first-step of a forward Euler discretized solver for the corresponding gradient flow equations.The log-likelihood descent transformation repels the posterior parameter distribution away from the targeted LOO observation, undoing part of its contribution to the full data posterior.While each transformation uses gradient information, their Jacobians are simple to approximate, requiring no computation of full Hessian matrices.
Generally, one will find that many observations are amenable to direct importance sampling with 1/ℓ weights (Eq.5) in combination with Pareto smoothing (tail weight distribution shape parameter k < 0.7).One needs only transform the sampling distribution when the estimated shape parameter exceeds this threshold.For a given posterior sample of model parameters θ 1 , . . ., θ s iid ∼ π(θ|D), one undergoes the procedure for each given observation i: procedure AdaptiveIS(observation i) Compute weights ν ik (Eq.7) and their tail shape

then Done
It is important to note that if any transformation takes k for a given observation under the threshold then adaptation is successful.

EXAMPLES
In this manuscript we consider the broad widely-used class of models that have a sigmoidal parameterization where σ(µ) = 1/(1 + e −µ ) is the sigmoid function and we denote µ i (θ) ≡ µ(θ, x i ) for some mean function µ.
For these models, the steps for each of the three transformations take the form and their Jacobians take the form where and π(θ) is the prior.
Here we will consider two popular sub-families of sigmoidal models.

LOGISTIC REGRESSION (LR)
LR is a sigmoidal model where µ i (θ) = x ⊺ i β, So, ∇ β µ i = x i , and ∇∇µ = 0.Because the Hessian of µ vanishes, the Jacobian of the function Q i for each of the three functions is a rank-one matrix and has only a single non-zero eigenvalue.LR admits exact Jacobian determinants for each of the three transformations:

BAYESIAN (RELU) NEURAL NETWORKS
Bayesian ReLU-nets [Lee, 2000, Ghosh and Doshi-Velez, 2017, Choi et al., 2018, Kristiadi et al., 2020, Bhadra et al., 2019] are piecewise linear [Sudjianto et al., 2021, Wang, 2022, Montúfar et al., 2014, Sudjianto et al., 2020] extensions to regression models.Being locally linear, these models have block-sparse Hessians and are also amenable to some limited degree of interpretability [Sudjianto et al., 2020, Chang et al., 2023].One may write an L-layer ReLU Bayesian neural network recursively where a is the ReLU activation function.The derivative of this function is the unit step function.We assume that the output function is sigmoid, noting that the softmax function also transforms into a sigmoid under a change of variables.Within the parameterization of Eq. 40 we absorbed the initial first-layer bias into the transformation W 1 , by assuming that x has a unit constant component, as is the convention in regression.
The Hessian matrix of µ, while non-zero, is sparse because all of the following identities hold: For this reason, the Jacobian determinant approximation of Eq. 28 can ignore the model Hessian entirely.However, in the case of one hidden layer we exploit the Hessian's structure to provide explicit exact expressions for J (•) .
Example 4.1 (One hidden layer).These models are governed by the equations µ = W 2 a(z 1 ) + b 2 and z The only nonzero components of the Hessian matrix for µ are the mixed partial derivatives The Hessian matrix of µ has a particular block structure that can be exploited (see Supplemental Materials S.2.1.1 for derivations) in order to find explicit expressions for its 2d non-zero eigenvalues, for k ∈ {1, 2 . . ., d}, and associated eigenvectors where and u k = a ′ ((z 1 ) k )x.To compute the overall transformation Jacobians, one can then apply rank-one updates to ∇∇µ -a process that is aided by projecting the model gradients into the eigenspace of the model Hessian (see S.2.1 for derivations).

IMPLEMENTATION
We implemented our methodology in Python using the github:mederrata/bayesianquilts wrapper for Tensorflow probability [Dillon et al., 2017].For Pareto smoothing, we used the Python implementation available at github:avehtari/PSIS.

DATASET AND MODEL
For demonstration, we used a public domain ovarian cancer micro-array dataset Hernández-Lobato et al. [2010], Schummer et al. [1999].This dataset consists of n = 54 observations of p = 1056 + 1 predictors.As an example of a p ≫ n problem, model-agnostic 1/ℓ importance sampling is insufficient for computing LOO expectations.Notably, Paananen et al. [2021] used this dataset to test their moment-matching adaptive importance sample where they successfully decreased the number of observations where k > 0.7 from between 34 and 36 to between 11 and 20, depending on the number of posterior samples.We reproduced their logistic regression model, using the same regularized-horseshoe [Piironen andVehtari, 2017b,c, Carvalho et al., 2009] prior, and the same statistical inference scheme using Stan, which we interfaced to Python using the package cmdstanpy.
We ran four parallel Markov Chains, with twelve thousand burn-in iterations, retaining 2000 samples per chain after thinning to every other sample.
Prior to any adaptation, we estimated k for the corresponding 1/ℓ tail weights for this model.Repeatedly simulating 64 posterior draws, we found 27 ± 4 observations having k in excess of the stability cutoff.

DECREASING k
We scanned different values of ρ = 10 −r , for r ∈ {0, 1, . . ., 6}, evaluating all three transformations for a given value of ρ.We found that for 21 ± 4 of the initially k > 0.7 observations that at least one of the transformations reduced from super-threshold to sub-threshold in k.Fig. 2 shows the results of one such simulated posterior draw.Generally, for too-large a value of ρ, all k values increased.

AREA UNDER ROC/PRC
Since n ≪ p, logistic regression is easily able to perfectly fit the dataset as defined by achieving in-training AUROC/AUPRC scores of approximately one.
For the model trained using MCMC, the LOO AUROC and AUPRC curves have areas slightly less than one (Fig. 3a).In contrast, the model, when inferred using mean-field ADVI, has lower LOO AUROC and AUPRC scores.This finding is likely due to the fact that the MCMC model, when using the full posterior, captures associations between the model parameters -in effect finding a lower-dimensional "factorization" of the features.The mean-field model is unable to capture parameter correlations by design and its LOO performance suffers.

DISCUSSION
In this manuscript we introduced an adaptive importance sampler for using pre-trained full-data posteriors to approximate leave one out cross validation (LOO) in Bayesian classification models.The objective of adaptation to bring the sampling distribution (the full data posterior) closer to the target LOO posterior distributions for each data point.
The methodology is based on taking a single step according to the gradient flow corresponding to minimization of a given objective.We introduce two such transformations: KL divergence descent and variance descent, along with a simple log-likelihood descent transformation.We presented explicit formulae for these transformations for logistic regression and ReLU-activated artificial neural networks with one hidden layer -the latter by computing the exact spectral decomposition of its Hessian matrix.We described how one can easily approximate the Jacobian of the transformations for more-complicated models, including for ReLU neural networks of any size.The adaptive importance sampler is ultimately used to estimate the expected LOO prediction for each given datapoint -quantities that can be used to compute downstream model generalization metrics such as ROC/PRC curves and the area under these curves.

LIMITATIONS
The main tradeoff of this method versus the model-agnostic PSIS-LOO method is that this method is model-dependent.In order to use this methodology for a given model, one needs to be able to evaluate gradients of the model with respect to parameters -and also the gradients of the corresponding prior distribution.Both the KL descent and variance descent transformations require computing the the posterior density -when a variational approximation of the posterior is not available or trustworthy this computation is costly for large datasets.

EXTENSIONS
In this manuscript we focused on classification problems but the methodology for adapting the importance sampler is much broader.In the Supplemental Materials one may find more-general formulae for the KL and variance descending transformations.In medical and industrial contexts, one is often interested in whether an individual or unit will experience an outcome within a certain time interval.For instance, policymakers are interested in hospital readmission within 30 days post discharge [Xia et al., 2023, Chang et al., 2023] because these readmissions are possibly preventable.In these problems, one may apply survival modeling to characterize the lifetime distribution, and additionally evaluate a model according to its classification performance at a given cut-off time T .Our methodology can easily be used for assessing such models.
Another extension to this methodology is to take more steps along the gradient flow for a given objective.Finally, we resolved the spectrum of the Hessian matrix for shallow ReLU models -the spectral decomposition for larger ReLU models may be useful for other analyses.
theory and edited the paper.HRY and JP edited the paper.

S.2 SIGMOIDAL MODELS
For these models one may use the chain rule to write the gradient, and Hessian ∇∇ℓ(θ|x, y) = ℓ(θ|x, y)∇∇ log ℓ(θ|x, y) for the likelihood function as a function of the gradient and Hessian of µ.
For the sake of space, we drop all indices i and let π = π(θ|x) and ℓ = ℓ(θ|x, y) All three transformations are of the form T (θ) = θ + Q(θ)

Figure 1 :
Figure 1: Relationships between probability densities.One wants to sample from π(θ|D(−i) ), the LOO distribution for observation i, by sampling from the full-data posterior π(θ|D).The transformation T i on the full-data posterior brings the sampling distribution closer to the target LOO distribution.

Figure 2 :
Figure 2: Applying transformations (blue=KL descent, green=variance descent, purple=LL descent) for different values of ρ (point labels are − log 10 ρ) to posterior samples of a logistic regression model fitted to the ovarian cancer dataset and evaluating k for those observations where the untransformed 1/ℓ IS weights (Eq.7) have k > 0.7.

Figure 3 :
Figure 3: LOO ROC curves for ovarian cancer classification model contrasting the model fitted using MCMC and the model fitted using mean-field ADVI.